Currently it’s hardly possible to perform serious scientific experiments without any sort of help from computers. They are needed for both controlling experimental setups and to analyze data. We will see in a series of posts how it’s easy to start doing data analysis with Python even for those without previous programming knowledge. The main idea of this tutorial is to allow a fast start to operate and understand how python can be used in scientific experiments. Some aspects can be difficult at the beginning, but a general recommendation is always valid: when in doubt, google for it. There are lots of information available in the web, specially for Python. The following contents aims to discuss the most basic aspects necessary to use python in basic data analysis, while a more extended text as a book would be an essential companion. Personally I like the book A Primer on Scientific Programming with Python, by Hans Petter Langtangen. Another great reference is the Official Python Tutorial, by Guido Van Rossum (the Python’s creator).
The first step is to install Python 3. There are several alternatives to that, and the Anaconda is one available in all platforms (Windows, Linux and Mac). There are also other Python packages available (Ex.: WinPython for windows platforms) or Python can be installed through a Linux package manager. However these alternatives will not be discussed, but are easily found in the web. Given that I use Windows, this series will consider Anaconda installed in this platform. However, the adjustments to use Linux or Mac should be minimal. In particular, the installation may ask if you want to register your Python distribution at some step. While not obligatory, sometimes this is very convenient. It is often disabled by default because then several distinct python’s can be installed in the same machine, but probably that has no use to us in what follows. Therefore, it’s suggested to register your Python distribution during installation.
The Jupyter notebook can be accessed from the system console if you’ve registered your python installation. The python registration is available during installation and is optional, it can be disabled. If the procedures below don’t work, you should be able to find Jupyter on your operational system’s start menu.
In Windows, its useful to create some folder to contain the files that you’ll be working with. Open that folder with the Windows Explorer (Ex.: My Documents\Jupyter Notebooks
) and access the menu File > Open Windows PowerShell > Open Windows PowerShell
.
To open the Jupyter notebook, type in the PowerShell
jupyter notebook
Then, you must wait for the web browser to open the initial page of the Jupyter notebook. At the upper right corner there is a button to create a new notebook. Select a New > Python 3 notebook
. Now everything is set and we may start our first python program.
By default, the new notebook is opened with an input field (a Cell) adequate for code input (for other input types see the dropdown menu currently stating Code
). Type print("Obey the gravity. It's the law!")
in the cell and press Shift + Enter
in your keyboard. The output must be
Obey the gravity. It's the law!
This is a small program which indicates that you want that the program returns a text (more correctly string) informing a given comment. Python understands anything between single 'text'
and double "text"
quotes as a text. In Jupyter notebook the output of a given cell is just below the associated input code.
Now you think… well… the above code is nice. But, I want to perform experiments, and what I need is a computer to do heavy and difficult calculations. Python allows you to also do that easily. Consider a sum of two terms:
2+2
Press Shift + Enter
to execute the above code and check that the answer is correct. You can also try to improve the output. It’s possible to send several arguments to any function by separating expressions with a comma. to make your statement stronger:
print("The sum of 2+2 is", 2+2)
Python is very good at math. Indeed, we can perform several mathematical operations. Sum +
, Difference -
, Multiplication *
and division /
operators are available. Powers can be calculated with **
(Ex.: \(2^3\) is written as 2**3
), and operations can be grouped with parenthesis (1/(1+1) == 0.5
, while (1/1) + 1 == 2
). Python calculates powers, multiplications and divisions before sums and differences (See operator precedence). Thus, when in doubt it’s a good practice to separate multiplication and division terms with parenthesis. We can also use variables to store information (numbers, texts and even more complex structures lying ahead). Suppose that we want to consider the parabolic trajectory of a mass due to the action of gravity. We may define the variables y0
, v0
and a
to store the initial information of the mass launch (initial height, initial velocity and acceleration) as follows:
= 0.0 # m
y0 = 20.0 # m/s
v0 = -9.8 # m/s**2 a
Whenever we refer later to y0
, v0
or a
we can retrieve these values from the computer memory. In this way we can, for example, write down mathematical formulas involving y0
, v0
and a
and let the computer remember which number is associated with each quantity. This is interesting because it allows us to do and redo a given set of calculations just by changing these initial values. We don’t have to rewrite each step of the calculation, since we will later describe in Python which calculations we want the computer to perform.
Notice that each instruction occupies one line. In the previous cell we had 3 instructions which only stored numbers in the memory. The units, as m/s, are not stored. Therefore, we must be careful with the units of each variable, or be sure that the adequate conversions are considered at the beginning or at a later stage. To keep track of the units it’s useful to add comments at the end of the line. Python will ignore everything after a #
symbol, and you may add any text. Comments are useful to understand what the program is doing at each step. Also notice that the decimal separator in python is the period .
, and not the comma ,
. Commas are always1 used to separate expressions, as function arguments.
Once all variables are available in the memory, you may calculate the mass height at a given time using the known formula \(y(t) = y_0+ v_0 t+\frac{at^2}{2}\). For example, at \(t=3\,\text{s}\), if we now request the python program to calculate
+ v0 * 3 + a * 3**2 / 2 # Calculate the mass height at t=3 s y0
15.899999999999999
To obtain the particle height at other times, we may replace the time \(t=3\,\text{s}\) by another time of interest. Also, we may the motion parameters y0, v0, a
by updating the variable values. For example, if the particle now has an initial velocity of 15 m/s, we may adjust the initial velocity with the following code:
= 15 # m/s v0
An important remark is that the variable names must be typed exactly in the same way everywhere. v0
and V0
are different variables. The syntax (or how do you exactly write things) is extremely important in a computer language. The computer is “dumb” and will only follow exact instructions. Any language imprecision will give errors.
Python variables can store other types of information, besides numbers. For example, they may be used to store text (strings
):
= "The current initial velocity is"
message
print(message, v0)
It would be extremely annoying if we had to enter the time everywhere it appears in the trajectory formula whenever we want to determine the mass position. One approach to let the program calculate for us by defining the function \(y(t)\) as a python function. There are two basic ways to define a function in python. Through the so called lambda functions, one has
= lambda t: y0 + v0 * t + a * t**2 / 2 y
the general structure of a lambda function is 'function name' = lambda 'arguments': 'python expression'
. To calculate the function value we can simply type y(3)
, y(1.5)
, … Test the output values in a new cell for several values of t
. Notice that the function contains y0
, v0
and a
which are variables stored in memory. Try to change the values of these variables and see how they affect y(t)
. While here we are considering a function of a single argument (\(y=y(t)\)), it’s possible to have multiple arguments too. For example, suppose that we want to consider the height at a time \(t\) for a particle that starts at \(y(0)=0\) and allow the initial velocity to change. We may define a more specific function to this problem
= lambda t, v0y: v0y * t - 9.8 * t**2 / 2 y2
Thus, if we want to consider this second scenario, we may use python to call y2(1, 10)
to know the particle height at t=1
s and with an initial velocity of 10
m/s. Notice the ordering: the first number indicates the value of t
, while the second number indicates the value of v0y
. Another possibility is to use a variable to set the argument values, as in y2(2, v0)
. We may also use y2(3.5, 13.2)
to consider the particle height at 3.5
s given an initial velocity of 13.2
m/s. Notice also that the comma is an argument separator (separates the arguments t
and v0y
), while the period is a decimal separator in python. Thus, if one tries to type a number with the comma as a decimal separator it will lead to an unexpected behavior (may even trigger error messages). For example, y2(3,5, 13,2)
will send four arguments (3
, 5
, 13
and 2
) to y2
, while y2
accepts only two arguments (t
and v0y
). Calling y2(3,5, 13,2)
will lead to an error message.
Sometimes the function definition through lambda functions as above is not very adequate, because Python functions are more general than mathematical functions and can also be used to request the computer to perform a more complex set of tasks. For instance, suppose that we want to know simultaneously both the particle height and velocity with a single function. We may define a function as
def parabolic_motion(t, v0y):
= v0y * t - 9.8 * t**2 / 2
yt = v0y - 9.8 * t
vt print('at t=', t, ', y(t)=', yt,', v(t)=', vt)
return yt, vt
# Function ends where indentation ends
= parabolic_motion(5, 15) final_height, final_velocity
Test the above code. After the line def ...
there are several instructions, which must be indented (that is, each line starts with two spaces or more, or a TAB key press). The same indentation that you use in the first line must be used in the following ones within the function. When the indentation stops it means that the instructions that should be performed by the function have ended. Given a time t
and initial velocity v0y
, it will compute the height and store it in yt
, then it will compute the particle velocity and store it in vt
. Then, it will print to the user a string informing the current time, height and velocity. print
accepts several arguments, where some of them are strings (text between quotes) and the others are variables. Always remember that commas are used to separate expressions or arguments. Finally the return
instruction will send the values of yt
and vt
as the function output. Thus, the above steps will be carried sequentially and the function will return two numbers that can be attributed to other variables. For instance, final_height, final_velocity = parabolic_motion(5, 15)
will set final_height=-47.5
and final_velocity=-34
. The general structure is
def 'function name'('arguments'):
'indented instructions'
...return output
# Function 'function name' ends here.
# The program instructions follow unindented below
The previous concepts can be used to define some simple mathematical functions, as polynomials. However, we often need a broader set of functions in our calculations. For instance, we may need trigonometric, hyperbolic, exponential or logarithmic functions. Also, we may need to use constants as \(\pi\) or \(e\). There are basically two ways to introduce such functions in a python program. The first one is through the native math
library. In programming, a library represents a set of functions (or objects) that solve some set of problems. To use a given library, we may use the import 'library'
instruction. To call a given library function or constant, it’s possible to use a syntax like 'library'.'function'
, or 'library'.'constant'
.
import math
print(math.sin(math.pi/2))
1.0
To consult the available constants, functions and other python objects, one may consult the math library documentation, or use some autocompletion features in the editor. For instance, if you type a period after math, math.
, and press the TAB
key, then the interface will display a menu containing the available properties. For instance, math.si
+tab
shows math.sin
and math.sinh
as options. Since these are functions, we may add the parenthesis to begin the function call. To see how these functions operate, its possible to use Jupyter notebook’s tooltip feature. Consider math.sin(
and type Shift+Tab
. A documentation regarding the syntax and usage of that function will be readily available.
While the above possibility is very general and can be used to import other libraries, the Jupyter notebook can perform a magic that automatically simplifies enormously the syntax of complex mathematical expressions. One of the most convenient is the %pylab
magic, which I recommend calling only once at the beginning cell of a notebook. Type in some cell
%pylab inline
the inline
option is related with the tool that we will use later to plot curves. After the %pylab inline
magic is called, we may calculate the sine more clearly:
print(sin(pi/2))
1.0
Python is also able to handle complex numbers natively. The imaginary number is 1j
, and one can easily obtain results as the Euler’s identity: \(e^{\jmath \pi} + 1 = 0\) is verified within the calculation precision,
1j * pi) + 1 exp(
1.2246467991473532e-16j
A remark on numeric notation in python. In a number, e
indicates the exponentiation over a base 10. Thus 1e1==10
, 1e2==100
, -1.5e-2==-0.015
. The smallest number that can be accurately calculated is called epsilon and can be obtained from sys.float_info.epsilon
. My machine’s epsilon
is 2.220446049250313e-16
. Since this is the smallest increment/decrement that can be calculated, some calculations suffer from accuracy with such small numbers and the expressions should be normalized. Numerical calculations may also involve infinity (for example, log(0)==-float('inf')
) while another important symbol is not a number, NaN
, which is returned for undefined operations as \(\infty - \infty\), or float('inf') - float('inf') == float('nan')
.
When using complex numbers, there are several auxiliary functions available. For example, if we have c = sqrt(3)/2 + 1j * 1/2
, we may compute the real and imaginary parts of c
simply from real(c)
and imag(c)
. The absolute value of c
is abs(c)
, while the angle (in radians) between c
and the real axis in the complex plane is angle(c)
. This angle is limited to the domain \([-\pi,\pi]\).
The %pylab
magic does not use the standard python math library. Instead, it uses the Numpy library, which has much more capabilities. Numpy contains a large set of tools to handle numerical data, from matrix operations to fast Fourier transforms. It’ worthwhile to comment that python is a great language to code with, because it’s fast to create and test pieces of code. However, this increased facility comes at a cost in processing time when python code is used directly. Numpy implements several important numerical procedures using other languages2 that process data much faster than native python. When crunching numbers, use Numpy/Scipy methods as much as possible. Scipy is another library which contains several applications of Numpy to a multitude of science-related problems.
Quite often one has to deal with functions defined within a certain range of variable values, a finite domain. The Numpy library has two main auxiliary functions that allow us to define this domain for mathematical functions, arange
and linspace
. arange([start], stop, [step])
indicates an array of integer numbers starting from start
(which is an optional argument, 0
by default) to stop
, in equally separated step
steps (optional, step=1
if not given). Notice that arange
does not include the stop
as a value in the array. This is sometimes useful, but must always be remembered. In the code below, x
starts at 2
and increases in steps of 2
, until the array value is lesser than the stop=9
value. This will produce an array containing the even numbers from 2
to 8
. y
contains the positive integers (including 0
) lesser than stop (remember that the default step
is 1
). It’s also possible to produce an array of fractional values with a given spacing, as in z
.
= arange(2, 9, 2)
x = arange(10)
y = arange(1, step=0.1)
z x, y, z
(array([2, 4, 6, 8]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]))
While arange
does not include the final point and one must define the spacing between values, linspace(start, stop, num)
takes another approach. It will create an array of values between start
and stop
. The endpoint is included by default. num
indicates how many points the array must have between start
and stop
, and is quite useful to increase or decrease the resolution in which a given function is represented in a given domain. The definition os a given step as in arange
can be less intuitive when we want to represent functions in different domains. linspace
is more adequate for fractional spacing than arange
.
= linspace(0, 1, 10)
x = linspace(0, 10, 3)
y = linspace(0, 10, 5)
z x, y, z
(array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ]),
array([ 0., 5., 10.]),
array([ 0. , 2.5, 5. , 7.5, 10. ]))
Notice that now x
contains an array of 10
values and includes the endpoints. Considering that the number of intervals is num-1
, the spacing between subsequent points is (stop-start)/(num-1)
. Notice num
represents the number of point in the array, and to obtain intervals with a given sort of regular spacing, as w=(stop-start)/divisor
, one must use num = divisor + 1
.
The output of arange
and linspace
are array
objects, which is the basic way that Numpy uses to handle sets of numbers. If we want to define arrays over a given set of values, we may also use the following syntax
= array([1,2,3,5,7,11,13])
a = array([0.1, 0.3, 0, -52, 42])
b = array([ 1, 1+2j, -3, -1-1j])
c a, b, c
(array([ 1, 2, 3, 5, 7, 11, 13]),
array([ 0.1, 0.3, 0. , -52. , 42. ]),
array([ 1.+0.j, 1.+2.j, -3.+0.j, -1.-1.j]))
The set of numbers separated by commas within square brackets [1, 2, ...,]
is a Python list, which is a python object that can be used to store ordered information. Beware that the order in which the data is added is retained. Also, notice that if one adds complex numbers the array automatically considers that all numbers in the array are complex. Each element in the list is associated with an index used to locate such element. To access the elements of the list x = ['a', 'b', 42, [-5, 1j] ]
we can use x[i]
, where i
ranges from 0
to 3
. In particular, x[0]=='a'
, x[1]=='b'
, x[2]==42
and x[3]==[-5, 1j]
.
A great aspect of arrays is that one may easily perform calculations over them. For instance, we may create an array with even numbers from 0 to 10 using an arange
. arange(6)
creates the array [0, 1, 2, 3, 4, 5]
, while the multiplication by 2
multiply each value individually.
2*arange(6)
All of the basic operations (+, -, *, /, **
) and the Numpy functions can be calculated over each value in the array. Thus, consider that we want to calculate the values of a linear function, \(y(x)= 2 + 15 x - 5 x^2\). The Numpy array is such that if we have an array x
and we call y = 2 + 15 * x - 5 * x**2
, then y
will be another array that contains the result of the calculation of the right-hand side expression for each value of x
. Numpy will get each value of x
in the array, evaluate the mathematical expression, and return the output value to y
at the corresponding x
value position. An important aspect is that Numpy evaluates the native mathematical functions and expressions quite efficiently, and since the python expression can look like the mathematical function that we want to calculate, it’s easy to check for errors.
= linspace(0, 5, 11)
x = 2 + 15 * x - 5 * x**2
y x, y
(array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]),
array([ 2. , 8.25, 12. , 13.25, 12. , 8.25, 2. , -6.75,
-18. , -31.75, -48. ]))
The above code represents the set of (x, y) values that satisfy the relationship expected for the function \(y(x)\). However, it would be much nicer if we could see this in a graphical representation, as we’ll see just below. To finish this section, we may notice that we may use functions to calculate these results.
= linspace(0, 5, 11)
x = lambda t: 2 + 15 * t - 5 * t**2
y x, y(x)
(array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]),
array([ 2. , 8.25, 12. , 13.25, 12. , 8.25, 2. , -6.75,
-18. , -31.75, -48. ]))
Notice that y
is a lambda
function that has a generic argument t
, instead of x
directly. An advantage of the function definition above is that now it is possible to evaluate the function y(t)
over distinct Numpy arrays, and we don’t have to rewrite everything. For example, suppose that somewhere else we want to evaluate y
over a larger domain z
and at a higher resolution. We may use simply
=linspace(-10,10,100)
z z, y(z)
Obtaining a list of calculated values is very important, since the computer can calculate very fast and efficiently. However, it’s often easier to understand data if we can visualize it in a graphic. The matplotlib package contains several tools to produce high-quality data plots (both in 2D and 3D). Another interesting aspect is that matplotlib contains several examples which are useful under several data plotting contexts. For us, we will see some of the most basic capabilities.
In the previous section, we calculated \(y(x)= 2 + 15 x - 5 x^2\) for \(x\in \left[0, 5\right]\). This function can be graphically represented with the simple command plot(x, y)
= linspace(0, 5, 11)
x = 2 + 15 * x - 5 * x**2
y
plot(x, y)r'assets\pes_parabola_plot1.png') savefig(
The function savefig(r'assets\pes_parabola_plot1.png')
saves the current matplotlib graphic in the external file pes_parabola_plot1.png
located in the folder assets
. The folder must exist before the file is written! Several output graphics file formats are supported, and this is useful to include the analyzed data in other documents (for example in Word, LaTeX,…). Notice that the above plot has some sharp corners, which are due to the small number of points used to represent \(y(x)\). We can improve the graphic resolution by changing the number of points in the x
array:
= linspace(0, 5, 101)
x = 2 + 15 * x - 5 * x**2
y
plot(x, y)r'assets\pes_parabola_plot2.png') savefig(
A typical scientific plot has several features, as a title, axis labels and maybe a legend, to indicate what is represented in each curve. In the example below we can see how to show several curves and add some of these features
# Parabola representation with a low resolution
= linspace(0, 5, 5)
x = 2 + 15 * x - 5 * x**2
y '--', label="low res.")
plot(x, y, '.', label="low res. points")
plot(x, y,
# Parabola representation with a high resolution
= linspace(0, 5, 101)
x = 2 + 15 * x - 5 * x**2
y '-', label="high res.")
plot(x, y,
# Plot properties
'$y(x)= 2 + 15 x - 5 x^2$')
title('x (m)')
xlabel('y (m)')
ylabel(# Use legends
legend() # Use grid
grid()
r'assets\pes_parabola_plot3.png') savefig(
The plot
function accepts several arguments, and its complete description is given in the documentation. The string after the x
and y
data is used to set the curve format properties (color, line type, markers…). In the above example we set the line type as dashed (--
), dotted (.
) or a solid line (-
, the default). It is also possible to associate a symbol to each pair of x,y
data coordinates, some common ones being o, *, +, x
that respectively generate the symbols \(\bullet, *, +, \times\). The color changes automatically, but can also be selected in the format string. To identify each curve in the plot, we label each curve. These labels only appear when legend()
is called. The location of legend
can be set by using the argument loc=
(See documentation). title
, xlabel
and ylabel
are used to set the plot title and horizontal and vertical axis labels. A nice feature is that matplotlib accepts LaTeX typeset formulas in its labels (check the LaTeX wikibook).
= linspace(0, 5, 101)
x = pi + 15 * x - 5 * x**2
y
plot(x, y)
r'$y(x)=\pi+\int_0^x \left(15 - \frac{20}{2} t\right) dt$')
title('x (m)')
xlabel('y (m)')
ylabel(
r'assets\pes_parabola_plot4.png') savefig(
In the above plot the title contains terms as the integral operator \(\int\rightarrow\)\int
and the greek letter \(\pi\rightarrow\)\pi
. LaTeX formulas must be within a string and enclosed by dollar signs $
. Quite often these strings can be misinterpreted by python, given that \
is used in strings to describe some special characters. Thus, when typing latex formulas one must define raw
strings by adding the character r immediately before quotes in the formula-containing string. Quite often we want to write sub or superscripts in our expressions, and the Latex notation is very simple. \(x_1\) can be written as $x_1$
, or a more general subscript must be enclosed between curly brackets, \(\psi_{1,2,3}\rightarrow\)$\psi_{1,2,3}$
. The superscripts are similar, typical examples being \(e^{-x^2}\rightarrow\)$e^{-x^2}$
or \(2^{2^2}\rightarrow\)$2^{2^2}$
.
To finish this section, we consider a few difficult function calculations. Notice that the same x
array is used to represent all functions.
= linspace(0, 10, 201)
x
= 3 * sin(6*x) * exp(-x)
y =r'$3 \sin(6x) e^{-x}$')
plot(x, y, label
= - 0.9 * sqrt(sin(x)**2 + cos(x)**2)
y ':', label=r'$- 0.9 \sqrt{\sin^2(x) + cos^2(x)}$')
plot(x, y,
= abs(x - 5)
y '--', label=r'$|x-5|$')
plot(x, y,
= log(2*x + 1)
y '-.', label=r'$\log\left(2x + 1\right)$')
plot(x, y,
r'Several mathematical functions')
title(
legend()'x')
xlabel('y')
ylabel(
r'assets\pes_plots_example.png') savefig(
In the present tutorial we have seen how to install Python and the Jupyter notebook, how to use Python to calculate simple expressions, a brief introduction to python variables and functions, Numpy mathematical tools and how to use matplotlib to plot Numpy arrays and numerical functions. In the following part we will see how to analyze experimental data in Jupyter notebook.